Scalability of the SAS/STAT HPGENSELECT High-Performance Analytical Procedure: A comparison with RevoScaleR
ثبت نشده
چکیده
Effectively implementing high-performance analytics software solutions in the insurance industry Executive Summary At the Strata Conference on October 25, 2012, the research and planning division of a large insurance corporation (hereafter " insurer ") presented various methods that they used to model 150 million observations of insurance data. A summary of their presentation is available at: The GENMOD procedure in SAS/STAT ® software Custom MapReduce code on a Hadoop cluster Open-source R Revolution R Enterprise using RevoScaleR The insurer reported that PROC GENMOD took more than five hours to fit a Poisson regression model, whereas Revolution Analytics used the RevoScaleR package to fit the model in 5.7 minutes. However, this is not an " apples to apples " comparison because RevoScaleR was run on a cluster of computers, whereas the GENMOD procedure executes only on a single server. A more informative comparison can be made by using the HPGENSELECT procedure, which had not yet been released at the time of the Strata comparison. Introduced in SAS/STAT 12.3 in June 2013, PROC HPGENSELECT runs in either single-machine mode (multiple threads on a single machine) or distributed mode (multiple threads on multiple machines). Distributed mode requires SAS ® High-Performance Statistics. Purpose This paper compares the performance of the HPGENSELECT procedure with results cited for the RevoScaleR package by using data that are similar to the insurer's data. The paper also demonstrates the scalability of the HPGENSELECT procedure by using two sizes of data sets and three different computing environments. Results On a small grid with two nodes, the HPGENSELECT procedure fits a Poisson regression model with 150 million observations in 159 seconds, which is less than half the time that RevoScaleR required on a somewhat larger grid. On a grid with 140 nodes, the HPGENSELECT procedure solves the problem in 22 seconds. The scalability of the HPGENSELECT procedure is demonstrated by increasing the size of the data set. For a data set that has the same variables and one billion observations, the procedure executes in less than one minute. These results, which are summarized graphically in Figure 1, show that the HPGENSELECT procedure provides a faster alternative.
منابع مشابه
SUGI 27: SAS(r) Meets Big Iron: High Performance Computing in SAS(r) Analytical Procedures
Version 9 targets the heavy-duty analytic procedures in SAS® for high performance computing enhancements. These enhancements encompass both algorithmic improvements and modifications to exploit multiprocessor hardware. This paper provides a survey of this development and the performance gains obtained in several procedures in SAS/STAT and Enterprise Miner. Some general scalability issues are ...
متن کاملSUGI 27: Up and Out: Where We're Going with Scalability in SAS(r) Version 9
This paper gives an overview of the ways that SAS is addressing performance through scalability in SAS Version 9. Scalability features have been implemented in many areas of SAS Version 9 to allow your applications to scale up and scale out. These include: • Multi-Process (MP) CONNECT, • the Scalable Performance Data Engine (SPDE engine), • certain SAS/ACCESS engines, • several scalable SAS pro...
متن کاملThe RANDOM Statement and More: Moving On with PROC MCMC
The MCMC procedure, first released in SAS/STAT® 9.2, provides a flexible environment for fitting a wide range of Bayesian statistical models. Key enhancements in SAS/STAT 9.22 and 9.3 offer additional functionality and improved performance. The RANDOM statement provides a convenient way to specify linear and nonlinear random-effects models along with substantially improved performance. The MCMC...
متن کاملAnalyzing a Regression Model with a General Positive Definite Covariance Matrix with The SAS System
This article discusses and proposes a procedure for the analysis of the univariate linear regression model with known general positive definite covariance matrix with SAS/STAT software of the SAS System. Estimation of parameters, hypothesis testing, estimation under constraints and collinearity and influence diagnostics are reviewed. An example is given to illustrate the procedure.
متن کاملImproving LoRaWAN Performance Using Reservation ALOHA
LoRaWAN is one of the new and updated standards for IoT applications. However, the expected high density of peripheral devices for each gateway, and the absence of an operative synchronization mechanism between the gateway and peripherals, all of which challenges the networks scalability. In this paper, we propose to normalize the communication of LoRaWAN networks using a Reservation-ALOHA (R-A...
متن کامل